# Use SGLang as backend

Lighteval allows you to use `sglang` as backend allowing great speedups.
To use, simply change the `model_args` to reflect the arguments you want to pass to sglang.

```bash
lighteval sglang \
    "model_name=HuggingFaceH4/zephyr-7b-beta,dtype=float16" \
    "leaderboard|truthfulqa:mc|0|0"
```

`sglang` is able to distribute the model across multiple GPUs using data
parallelism and tensor parallelism.
You can choose the parallelism method by setting in the `model_args`.

For example if you have 4 GPUs you can split it across using `tp_size`:

```bash
lighteval sglang \
    "model_name=HuggingFaceH4/zephyr-7b-beta,dtype=float16,tp_size=4" \
    "leaderboard|truthfulqa:mc|0|0"
```

Or, if your model fits on a single GPU, you can use `dp_size` to speed up the evaluation:

```bash
lighteval sglang \
    "model_name=HuggingFaceH4/zephyr-7b-beta,dtype=float16,dp_size=4" \
    "leaderboard|truthfulqa:mc|0|0"
```

## Use a config file

For more advanced configurations, you can use a config file for the model.
An example of a config file is shown below and can be found at `examples/model_configs/sglang_model_config.yaml`.

```bash
lighteval sglang \
    "examples/model_configs/sglang_model_config.yaml" \
    "leaderboard|truthfulqa:mc|0|0"
```

> [!TIP]
> Documentation for the config file of sglang can be found [here](https://docs.sglang.ai/backend/server_arguments.html)

```yaml
model_parameters:
    model_name: "HuggingFaceTB/SmolLM-1.7B-Instruct"
    dtype: "auto"
    tp_size: 1
    dp_size: 1
    context_length: null
    random_seed: 1
    trust_remote_code: False
    use_chat_template: False
    device: "cuda"
    skip_tokenizer_init: False
    kv_cache_dtype: "auto"
    add_special_tokens: True
    pairwise_tokenization: False
    sampling_backend: null
    attention_backend: null
    mem_fraction_static: 0.8
    chunked_prefill_size: 4096
    generation_parameters:
      max_new_tokens: 1024
      min_new_tokens: 0
      temperature: 1.0
      top_k: 50
      min_p: 0.0
      top_p: 1.0
      presence_penalty: 0.0
      repetition_penalty: 1.0
      frequency_penalty: 0.0
```

> [!WARNING]
> In the case of OOM issues, you might need to reduce the context size of the
> model as well as reduce the `mem_fraction_static` and `chunked_prefill_size` parameter.
